Creating modular and reusable DSL textual 
syntax definitions with Grammatic/ANTLR 



Andrey Breslav 

St. Petersburg State University of Information Technology, Mechanics and Optics 

abreslav@gmail . com 



Abstract. In this paper we present Grammatic - a tool for textual 
syntax definition. Grammatic serves as a front-end for parser generators 
(and other tools) and brings modularity and reuse to their development 
artifacts. It adapts techniques for separation of concerns from Apsect- 
Oriented Programming to grammars and uses templates for grammar 
reuse. We illustrate usage of Grammatic by describing a case study: 
bringing separation of concerns to ANTLR parser generator, which is 
achieved without a common time- and memory-consuming technique of 
building an AST to separate semantic actions from a grammar definition. 



1 Introduction 

When adopting a concept of Domain-Specific Languages (DSLs, see [T]) and 
developing textual syntax for them we are using syntax-related tools extensively. 
Thus we need different tools which in most cases use context-free grammars to 
define language syntax, we will call them "grammarware engineering tools" after 
the paper [5] . All these tools use grammar definitions and (according to the same 
paper) there is a strong need in applying software engineering practices in this 
area. In the present paper we address a problem of modularity and reuse of 
grammar definitions. 

All the grammarware engineering tools have to support reuse of their input 
artifacts, but it requires tools' authors quite an effort to implement it. We exam- 
ined popular parser generators [3141516171819110] and only three of them |8I9I10| 
have strong reuse capabilities, though even they could be improved in this sense. 
And grammarware engineering is not limited to parser generators. May be this 
is natural: when working on a new tool addressing some syntax-related problem 
(i. e. implementing a new parsing algorithm or a new concept of a pretty-printer) 
probably one of the last things a developer has on his/her ToDo list is gram- 
mar definition reuse, since it is a complicated feature which is mostly irrelevant 
to what he or she is working on. Anyway it is not likely to appear in the first 
version. 

In UNIX world this problem is solved by the following principle [TT]: Make 
each program do one thing well. Probably it would be ideal if all the grammar- 
ware tools could use a common grammar definition language with a common 
solution for reuse problems. Then it would be easy to support modular and 
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reusable syntax definitions and in addition all the tools would have a common 
data format by using which they could interoperate with each other. 

To make a step towards this solution we propose a common grammar defi- 
nition language, named Grammatic, that provides strong modularity and reuse 
capabilities out of the box. 

One of the main problems in making it suitable for a wide range of tools is 
that each tool requires different information to be attached to a grammar. Almost 
no tool takes a mere EBNF definition as input, each one extends it with some 
extra data. To cope with this Grammatic allows to extend a grammar definition 
with arbitrary metadata which can be represented in a common format and 
attached to a grammar externally for the sake of separation of concerns. 

An author of a new tool may use Grammatic as follows: 

- Use Grammatic's grammar definition language to define grammars. 

- Define extensions for grammar definitions in terms of metadata. 

- Write custom back end for processing the definition using Grammatic's API. 

This allows the developer to implement modularity and reuse features easily and 
concentrate on his or her tool's specific functionality. 

But there are many great tools already. They are rarely strong in terms of 
reuse and even more rarely interoperate well with each other. To benefit both 
from such tools and Grammatic we can use the latter as a front end, namely we 
can: 

- Identify extensions which the tool adds to a pure grammar definition lan- 
guage. 

- Decide how to express those extensions with metadata attached to grammar 
elements. 

- Write a generator which converts a properly annotated Grammatic grammar 
definition into the tool's input language. 

After doing this we can use modular grammar definitions throughout the devel- 
opment process. A not necessarily modularized input for the tool in question is 
generated only when needed and is never modified by hand. 

In this paper we present a case study on the latter case: we demonstrate using 
Grammatic as a front end for the ANTLR parser generator [4] . ANTLR is very 
popular due to its flexibility, clearness and many target languages supported. 
On the other hand it lacks modularity and supports reuse rather weakly. 

The paper is organized as follows: in the next section we give a short overview 
of Grammatic's main features. Scction[3]gives an overview of the case study. Sub- 
section [3~Tl dcscribes a simple way of attaching Grammatic to ANTLR mentioned 
above and subsection 13.21 describes creating of a more usable though somewhat 
less general parser generator with Grammatic and ANTLR. Some concluding 
remarks are given in section [4] 

2 Grammatic's features overview 

Here we give an overview of four languages constituting Grammatic's core. These 
languages are used to define modular grammars and attach metadata to them. 
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2.1 Grammar definitions 

A grammar definition language allows to define a context-free grammar as a set 
of symbols each of which is associated to a set of productions (in concrete syntax 
we separate productions by "||"). Here is an example grammar definition which 
we will use throughout this paper (it describes a simple language of constants 
and typed variables with assigned values in form of arithmetic expressions) : 

const : ID '=' sum ';' ; 

varDecl : type ID ('=' sum)? ';' ; 

type : ID; 

sum : mult (' + ' mult)* ; 
mult : factor ('*' factor)* ; 
factor : NUM II ID II '(' sum ')' ; 
ALPHA : L'a' — 'z' 'k' — 'Z' '_'] ; 
ID : ALPHA (ALPHA I ['0'-- '9'])* ; 
NUM : ['0' — '9']+ ; 

In this example characters in single quotes represent embedded lexical defi- 
nitions. There is no separate notion of a lexical rule (since it is not necessarily 
required, see [12|13j ) and we use the same syntax for EBNF and regular expres- 
sions. In this grammar symbol sum is (virtually) nonterminal and ID is (also 
virtually) terminal since it has only regular expressions on the right side. 

2.2 Imports and Templates 

As we told above, Grammatic's grammar definition language provides strong 
reuse techniques. Ideas behind these techniques are generalized from the ones 
implemented in Rats! [TU], SDF [5] and LISA [5]. First we focus on reusing 
grammar definitions themselves. 

The most popular way of reuse is importing. Some grammar definition A 
might be imported into some other grammar definition B. This means that all 
the rules of A are inserted into B. Rules of B may refer to symbols of A - this 
is the way two grammar definitions are connected. 

Very frequently we have to customize some of the imported rules, i.e. add 
some more productions to the same symbols or replace existing productions. In 
paper [14] this is referred to as rule overriding. In Grammatic wc decided to use 
more general form of this concept, namely templates. 

A language of grammar templates allows creating grammar definitions with 
"placeholders" which can be replaced with actual objects upon template instan- 
tiation. Placeholders might be defined for roles of identifier, expression, produc- 
tion or symbol. A template instantiation might result into grammar object of 
the type specified by template declaration. Here is an example of a template and 
its usage. 

Symbol binaryOperation<ID $name, Expression $sign, Expression $argument> { 
$name — > $argument ($sign $argument)*; 
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} 

import binaryOperation<Product , I '/', Factor>; 

import binaryOperation<Sum, '+' I Product>; 
Factor 

— > NUMBER 

I I ID 

II ' ( ' Sum ' ) ' 
> 

In this example we define a template named "binaryOperation" which makes 
up an infix binary operation out of symbol name, sign and argument expression. 
Then we instantiate it twice and import instantiation results into current gram- 
mar definition - so we can use new symbol "Product" to create "Sum" and 
"Sum" to define "Factor". 

How can we use templates for "overriding" things? We can put a customizable 
set of rules into a template, provide a placeholder for production or subexpression 
that should be replaced and then put a right thing in upon instantiation. 

Symbol attributeValue<Production* $moreValueTypes> { 
AttributeValue 
— > STRING 
I I ID 
I I INT 

I I Annotation 
I I ValueSequence 
I I $moreValueTypes 

> 

import attributeValue< 

'{{{' Expression '}}}> 

>; 

This defines a template for "AttributeValue" symbol and instantiates it 
adding a new production (to use expressions as attribute values). 

2.3 Metadata 

As we told above Grammatic allows to attach arbitrary metadata to a grammar 
definition in order to express various extensions used by specific tools. 

Metadata annotations might be attached to a grammar, symbol, individual 
production or a subexpression. Each annotation may contain several attributes 
(name- value pairs). Attribute values may be of different types. There are several 
predefined value types: ID, STRING, INTEGER, TUPLE (a number of name-value 
pairs) and SEQUENCE of values and punctuation symbols. 
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id = someName; // ID 

str = "some string"; // STRING 
int =10; // INTEGER 

class = { // TUPLE (name : ID; super : ID) 

name = MyClass; 
super = Object; 

>; 

astProduction = {{ // SEQUENCE 
"('+' left "('-' right 10)) 

»; 

Users may add their own types. No attribute itself has any fixed semantics. 
Metadata is passive, some tools (like analyzers, transformers, generators etc.) 
may use it according to their needs. 

Even without adding custom types many things might be expressed by such 
annotations. The most powerful type is SEQUENCE - it allows to define small 
embedded DSLs inside Grammatic. We use such DSLs to describe complicated 
custom properties (see section [ 



2.4 Queries 

How to attach metadata to a grammar? In many cases it is done by directly 
embedding annotations into grammar definition. Therefore different concerns 
are mixed together and this results into a problem: system is not modular, is 
hard to understand and extend. 

We employ ideas from aspect-oriented programming paradigm (AOP, see 
|15j ) to solve this problem. In Grammatic a grammar definition itself knows 
nothing about metadata. All the metadata is attached "from the outside". In 
AOP this is done by defining join points which are described by point cuts [TB] . 
A language of point cuts is a kind of "addressing" notation - a way to find some 
object. When we have found such an object we may attach metadata to it (or 
perform other actions, see below). 

In Grammatic we have a language analogous to AOP point cuts - we call it a 
query language. For example this query matches rules defining binary operations: 

#0p — > #Arg (#Sign #Arg)* ; 

All the names here represent variables. This query matches any of the fol- 
lowing rules: 

sum : mult ('+' mult)* ; 
mult : factor ('*' factor)* ; 

By default a variable matches a symbol but it may match a subexpression 
or a whole production. 

Symbol $production : — > $alt:(A I B) ; 



6 



Here variables A, B and C match symbol references and Alt matches a subex- 
pression "B | C". 

We can use wildcards in queries. The following query matches immediately 
left-recursive rules: 

#Rec — > #Rec . . ; 

Two dots represent a wildcard which matches arbitrary subexpression. 
We can consider metadata in our queries. We can restrict a particular at- 
tribute to a certain type or value or require attribute's presence or absence: 

#N { 

type = Nonterminal; 
operation; 
associativity : ID; 
! commutative ; 

} 

This query matches a symbol with "type" attribute having value "Nonter- 
minal", "operation" attribute present, "associativity" attribute having value of 
type ID and "commutative" attribute not present. 

2.5 Aspects 

When a query selects some objects from a grammar definition, we can attach 
some metadata to them. 

#Rec — > #Rec . . ; 
[[ 

Rec { 

lef tRecursive ; 

}; 

]]; 

This rule adds a "lcftRccursive" attribute (with no value) to all the symbols 
matched by Rec variable of this query. A set of such rules constitutes an "aspect" . 
Many aspects (independent or not) might be assigned to a single grammar, and 
even to many grammars since our queries are not tied to concrete objects but 
only to a grammar structure. 

Aspects themselves might be generally reusable - as we told above, query 
language does not require "hard linking" to grammar objects, these objects are 
located by their structural context and properties. The rule in our example 
constitutes a reusable aspect - we can use it on any grammar. 
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3 A Case Study: ANTLR 

One of the most popular Java-targeted parser generators now is ANTLR [4]. 
It is a mature tool based on LL(*) recursive descent parsing algorithm which is 
empowered by syntactical predicates and backtracking. Many projects, including 
Sun's NetBeans use ANTLR to generate their parsers. 

On the other hand, ANTLR has some weaknesses in the sense of modularity 
and reuse. The main issue is that it uses embedded semantic actions, which 
means that the syntactical structure of the language is physically mixed with 
Java code describing semantic actions. Thus ANTLR grammars look bloated 
and grammar structure is not clear. There are also some issues with grammar 
reuse capabilities though they are being resolved in newer versions (see |14j). 

We want to use ANTLR's powerful features but working with modular gram- 
mar definitions and having Java code clearly separated from the grammar. 

Further sections describe how this could be done with Grammatic. 

3.1 A Straightforward Solution 

A general way of achieving this with Grammatic was described in section [TJ 
we can identify ANTLR's extensions to EBNF, express them in Grammatic's 
metadata and write a generator to convert annotated Grammatic definitions to 
ANTLR input language. 

Now let us look at the ANTLR's extensions to EBNF. For the sake of brevity 
we focus on the most valuable of them here: 

- Specifying rule parameters and return types. 

- Embedding semantic actions written in Java. 

- Specifying syntactic predicates. 

To express these extensions with metadata we define the following attributes 
to be used with grammar elements: 

Rules: 

• returns : ID; - return type. 

• params : SEQUENCE of TUPLE(type : ID; name : ID) ;- parameters. 

- Productions: 

• predicate : STRING; - syntactic predicate for the production. 

• before : STRING; - semantic action to be performed before the pro- 
duction. 

• after : STRING; - semantic action to be performed after the produc- 
tion. 

- Expressions: 

• after : STRING; - semantic action to be performed after the expres- 
sion. 

- Rule calls: 

• arguments : SEQUENCE of ID - arguments for the rule call. 
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Wc define semantic actions as simple strings and it is very close to how 
ANTLR actually treats them. 

For example let us define an aspect which assigns ANTLR metadata to our 
arithmetic expressions grammar (see above). We want to have a parser which 
computes a value of the expression being parsed. Thus our semantic actions will 
perform arithmetic operations and return values of type int. Here is a sample 
metadata assignment for the rule sum: 

sum [ [returns = int ; ] ] 
$:— > .. 
[[ 

before = '##result =0;'; 
#mult . after = << 

##result += #mult ; 

»; 

]]; 

This assigns a 'returns' attribute to the symbol sum itself, 'before' semantic 
action to the production and 'after' semantic action to all occurrences of 
'mult' on the right side. In action bodies we have ##result and #mult which 
correspond to a result variable of a rule and a variable that denotes a value 
returned by mult. These semantics is to be defined by a generator which will 
convert our Grammatic definition into ANTLR's language because it depends 
only on how this generator will treat metadata assigned to grammar elements. 

We can handle syntactic predicates the same way. To get the following 
ANTLR definition: 

NEWLINE 

: C\r'? '\n')=> '\r'? '\n' 

I '\r' 

> 

Wc define a grammar rule: 

NEWLINE : '\r'? '\n' II '\r' ; 

And a metadata assignment rule: 

NEWLINE 

$:— > .. 
[[ 

predicate = «'\r'? '\n'»; 

]] 

— > '\r'; 

This metadata should also be properly treated by the generator. 

Other specific features of ANTLR (like grammar names, Java imports etc.) 
can be expressed the same way. This method is general enough to be applied 
in all cases we can imagine and all it adds to the original tool is separation of 
concerns and reuse techniques available in Grammatic. However, the generator 
may add some more value to the original tool. We show an example below. 
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3.2 A More Sophisticated Solution 

Nowadays a programming language is usually supported by a strong IDE which 
makes common activities easier. For example, features brought by the Eclipse 
IDE for Java [17] include syntax highlighting, code completion, semantic high- 
lighting, templates, refactorings and many more things. That's why a developer 
won't be glad to enter Java code outside a specialized editor, say, in ANTLR ed- 
itor or Grammatic editor. Although these "non-Java" editors may provide basic 
features like highlighting and folding, they are unlikely to provide refactorings 
and other complicated features. Therefore we want to separate all the Java code 
from grammar definitions in such a way that it could be edited separately - in 
Java editor, using all of its power. 

Some tools }8|6|4j solve this problem by generating a parser that builds an 
AST which is to be processed by external code. This approach has the follow- 
ing disadvantages: it consumes memory for storing AST and time for walking 
through it. There is also another drawback: many parsers for DSLs simply build 
models which are very close to ASTs but slightly different (have cross-references, 
specific additional attributes etc.), in this case a work done by an AST trans- 
former (a program which converts an AST into a model) is simply a waist of 
resources since all the additional information might be assigned during the pars- 
ing process. Thus we want to avoid building such ASTs. 

Instead of making a parser always build an AST we propose to use "Builder" 
design pattern [18] : a parser should only call methods of some interfaces (builders) 
which are implemented outside it. Builder interfaces abstract semantic actions 
of the parser, they are generated along with the parser's code. To illustrate this 
a bit let us have a look at our sum rule (in ANTLR): 

sum returns [int result] 
: {result = 0;} 

left=mult {result += left;} 

('+' right=mult {result += right;})* 

> 

Semantic actions can be abstracted here like this: 

sum returns [int result] 
Qinit { 

ISumBuilder builder = myBuilders . getSumBuilder () ; 

} 

: {builder . init (); } 

lef t=mult {builder . left (left) ; } 

('+' right=mult {builder . right (right) ;}) * 

{result = builder . getResult () ; } 

> 

This is more verbose than immediately embedded actions but this should be 
generated, no one is to write it by hand. 
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This approach requires less memory and time since we do not need to build 
AST objects (which requires memory consumption proportional to input length) 
and traverse across them. All we need is to create builder objects: this requires 
us to build only one object for each call that is simultaneously present in the 
call stack, so it requires memory consumption proportional to the stack depth. 

How Grammatic can help us? We are going to define metadata which will give 
a generator enough information to generate builder interfaces and an ANTLR 
grammar definition with embedded builder calls. 

What metadata do we need to be able to generate builders along with 
ANTLR grammar? The following information is sufficient: 

- Return values and parameters for each rule. 

- Arguments for each rule call. 

To give a more illustrative example (and create a more flexible system) we 
will also allow many rules for each grammar symbol. This is useful since we can 
have only one signature specification (parameters and return value) for each syn- 
tactical rule, but the rule might have different semantics when called in different 
contexts. For example (although a bit strained) we may distinguish constant ex- 
pressions from ones containing variables, since constant ones may be calculated 
at compile time. When writing a compiler, we want constant expressions to be 
evaluated in place (the rule must return a value) and variable expressions are to 
be stored as objects (expression trees). Hence, from one grammar rule 

sum : mult (' + ' mult)*; 

we get two rules with different return types and parameters: 

varSum [Scope scope] returns [Expression result] 
: varMult [scope] ('+' varMult [scope] ) * ; 

constSum [Context context] returns [int result] 
: constMult [context] (' + ' constMult [context] ) * ; 

(We assume that Scope maps variable names to objects denoting variables and 
Context maps constant names to values.) 

We do not want to duplicate rules in our grammar for the sake of these 
matters, so we will express this in metadata for a single rule. Here is an example 
of how we can do it: 

sum 

[[ 

builders = {{ 

Expression varSum(Scope scope) ; 
int constSum (Context context) ; 

»; 

]] 

— > mult . . 
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[[ 

#mult . call = { 

varSum = {{varMult (scope) }} ; 
constSum = {{constMult (context)}}; 

}; 

]]; 

What we see here is a small DSL inside Grammatic metadata (actually there are 
two DSLs: one for specifying return types and parameters and another one for 
specifying called rules and arguments). The 'builders' attribute of a symbol 
defines signatures (names, return types and parameters) of ANTLR rules gen- 
erated for this symbol (two rules will be generated in this example), the 'call' 
attribute of a symbol reference specifics an ANTLR rule (with arguments) which 
should be called in each case. 

The brevity of metadata definition given above is achieved through Gram- 
matic's ability to define internal DSLs. This is done by parsing attribute values 
of type SEQUENCE by an externally supplied parser. Lexical structure of these 
DSLs is fixed: sequence elements (identifiers, strings, numbers, tuples, sequences 
and punctuation values) serve as tokens. 

A generator will produce two rules given above form this example. Above we 
omitted builder calls from rule definitions to be more clear. A full rule looks like 
this: 

varSum [Scope scope] returns [Expression result] 
Oinit { 

IVarSumBuilder builder = myBuilders . getVarSumBuilder (scope) ; 

} 

: vm=varMult [scope] {builder . varMult (vm) ; } 

('+' vml=varMult [scope] {builder .varMult (vml)})* ; 

4 Conclusion and Future Work 

This paper addressed solving problems of modularity and reuse of grammar 
definitions by defining a general front end, Grammatic. This front end can be 
adopted by newly developed tools through its API or it can be attached to an ex- 
isting tool by creating a converter from its universal format to the tool's specific 
input format. Grammatic provides a language for defining modular grammars 
(supporting templates and imports), which is extensible by attaching arbitrary 
metadata. It also supports separation of concerns by defining reusable aspects. 

We showed two ways of using Grammatic to bring reuse and separation of 
concerns to a popular parser generator - ANTLR. The most straightforward way 
is based on expressing extensions added by the tool to a general grammar defi- 
nition language in terms of metadata and creating a generator which transforms 
an annotated Grammatic's definition into the tool's input. We also presented a 
way of using Grammatic to separate custom code written in the target language 
(Java) from the grammar definition and other metadata. This is done by using 
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"Builder" design pattern: generating a set of interfaces which are to be imple- 
mented manually. This allows a developer to use the power of his IDE when 
working with Java code. 

The case study shows that Grammatic helps adding reuse and modularity 
capabilities to existing tools. We plan to apply such practices to some more 
tools to find out more things to be supported by Grammatic. For now we plan 
to support metadata templates, grammar testing facilities, text generation and 
error tracking facilities helping to convert errors reported by a back end to 
Grammatic's errors. 

Our long term goal is to create a common grammar definition platform usable 
by a wide range of grammarware engineering tools. 
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